31 research outputs found
Hi-Fi: Hierarchical Feature Integration for Skeleton Detection
In natural images, the scales (thickness) of object skeletons may
dramatically vary among objects and object parts, making object skeleton
detection a challenging problem. We present a new convolutional neural network
(CNN) architecture by introducing a novel hierarchical feature integration
mechanism, named Hi-Fi, to address the skeleton detection problem. The proposed
CNN-based approach has a powerful multi-scale feature integration ability that
intrinsically captures high-level semantics from deeper layers as well as
low-level details from shallower layers. % By hierarchically integrating
different CNN feature levels with bidirectional guidance, our approach (1)
enables mutual refinement across features of different levels, and (2)
possesses the strong ability to capture both rich object context and
high-resolution details. Experimental results show that our method
significantly outperforms the state-of-the-art methods in terms of effectively
fusing features from very different scales, as evidenced by a considerable
performance improvement on several benchmarks.Comment: IJCAI201
ChatAnything: Facetime Chat with LLM-Enhanced Personas
In this technical report, we target generating anthropomorphized personas for
LLM-based characters in an online manner, including visual appearance,
personality and tones, with only text descriptions. To achieve this, we first
leverage the in-context learning capability of LLMs for personality generation
by carefully designing a set of system prompts. We then propose two novel
concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for
diverse voice and appearance generation. For MoV, we utilize the text-to-speech
(TTS) algorithms with a variety of pre-defined tones and select the most
matching one based on the user-provided text description automatically. For
MoD, we combine the recent popular text-to-image generation techniques and
talking head algorithms to streamline the process of generating talking
objects. We termed the whole framework as ChatAnything. With it, users could be
able to animate anything with any personas that are anthropomorphic using just
a few text inputs. However, we have observed that the anthropomorphic objects
produced by current generative models are often undetectable by pre-trained
face landmark detectors, leading to failure of the face motion generation, even
if these faces possess human-like appearances because those images are nearly
seen during the training (e.g., OOD samples). To address this issue, we
incorporate pixel-level guidance to infuse human face landmarks during the
image generation phase. To benchmark these metrics, we have built an evaluation
dataset. Based on it, we verify that the detection rate of the face landmark is
significantly increased from 57.0% to 92.5% thus allowing automatic face
animation based on generated speech content. The code and more results can be
found at https://chatanything.github.io/
Large-scale Unsupervised Semantic Segmentation
Empowered by large datasets, e.g., ImageNet, unsupervised learning on
large-scale data has enabled significant advances for classification tasks.
However, whether the large-scale unsupervised semantic segmentation can be
achieved remains unknown. There are two major challenges: i) we need a
large-scale benchmark for assessing algorithms; ii) we need to develop methods
to simultaneously learn category and shape representation in an unsupervised
manner. In this work, we propose a new problem of large-scale unsupervised
semantic segmentation (LUSS) with a newly created benchmark dataset to help the
research progress. Building on the ImageNet dataset, we propose the ImageNet-S
dataset with 1.2 million training images and 50k high-quality semantic
segmentation annotations for evaluation. Our benchmark has a high data
diversity and a clear task objective. We also present a simple yet effective
method that works surprisingly well for LUSS. In addition, we benchmark related
un/weakly/fully supervised methods accordingly, identifying the challenges and
possible directions of LUSS. The benchmark and source code is publicly
available at https://github.com/LUSSeg.Comment: Benchmark and Source Code: https://github.com/LUSSe
A Relax Inexact Accelerated Proximal Gradient Method for the Constrained Minimization Problem of Maximum Eigenvalue Functions
For constrained minimization problem of maximum eigenvalue functions, since the objective function is nonsmooth, we can use the approximate inexact accelerated proximal gradient (AIAPG) method (Wang et al., 2013) to solve its smooth approximation minimization problem. When we take the function g(X)=δΩ(X)  (Ω∶={X∈Sn:F(X)=b,X⪰0}) in the problem min{λmax(X)+g(X):X∈Sn}, where λmax(X) is the maximum eigenvalue function, g(X) is a proper lower semicontinuous convex function (possibly nonsmooth) and δΩ(X) denotes the indicator function. But the approximate minimizer generated by AIAPG method must be contained in Ω otherwise the method will be invalid. In this paper, we will consider the case where the approximate minimizer cannot be guaranteed in Ω. Thus we will propose two different strategies, respectively, constructing the feasible solution and designing a new method named relax inexact accelerated proximal gradient (RIAPG) method. It is worth mentioning that one advantage when compared to the former is that the latter strategy can overcome the drawback. The drawback is that the required conditions are too strict. Furthermore, the RIAPG method inherits the global iteration complexity and attractive computational advantage of AIAPG method
Exploring Feature Self-relation for Self-supervised Transformer
Learning representations with self-supervision for convolutional networks
(CNN) has proven effective for vision tasks. As an alternative for CNN, vision
transformers (ViTs) emerge strong representation ability with the pixel-level
self-attention and channel-level feed-forward networks. Recent works reveal
that self-supervised learning helps unleash the great potential of ViTs. Still,
most works follow self-supervised strategy designed for CNNs, e.g.,
instance-level discrimination of samples, but they ignore the unique properties
of ViTs. We observe that modeling relations among pixels and channels
distinguishes ViTs from other networks. To enforce this property, we explore
the feature self-relations for training self-supervised ViTs. Specifically,
instead of conducting self-supervised learning solely on feature embeddings
from multiple views, we utilize the feature self-relations, i.e.,
pixel/channel-level self-relations, for self-supervised learning. Self-relation
based learning further enhance the relation modeling ability of ViTs, resulting
in strong representations that stably improve performance on multiple
downstream tasks. Our source code will be made publicly available